Skip to content

Introduce ARM Neon and SSE2 SIMD. #743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Apr 28, 2025
Merged

Conversation

samyron
Copy link
Contributor

@samyron samyron commented Feb 3, 2025

Version 2 of the introduction of ARM Neon SIMD.

There are currently two implementations:

  1. "Rules" based.
  2. Lookup Table based. This is effectively an SIMD accelerated version of the scalar implementation.

Benchmarks (Lookup table)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    62.000 i/100ms
          json_coder    67.000 i/100ms
                  oj    30.000 i/100ms
Calculating -------------------------------------
                json    628.035 (±12.7%) i/s    (1.59 ms/i) -      3.162k in   5.118636s
          json_coder    626.843 (±15.8%) i/s    (1.60 ms/i) -      3.082k in   5.079836s
                  oj    352.174 (± 9.4%) i/s    (2.84 ms/i) -      1.740k in   5.005929s

Comparison:
                json:      628.0 i/s
          json_coder:      626.8 i/s - same-ish: difference falls within error
                  oj:      352.2 i/s - 1.78x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    50.000 i/100ms
          json_coder    56.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    632.784 (±27.0%) i/s    (1.58 ms/i) -      3.000k in   5.063991s
          json_coder    628.328 (±16.7%) i/s    (1.59 ms/i) -      3.080k in   5.034271s
                  oj    351.466 (± 9.7%) i/s    (2.85 ms/i) -      1.728k in   5.003977s

Comparison:
                json:      632.8 i/s
          json_coder:      628.3 i/s - same-ish: difference falls within error
                  oj:      351.5 i/s - 1.80x  slower

Benchmarks (Rules based)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    69.000 i/100ms
          json_coder    78.000 i/100ms
                  oj    33.000 i/100ms
Calculating -------------------------------------
                json    758.135 (±22.7%) i/s    (1.32 ms/i) -      3.657k in   5.114664s
          json_coder    800.957 (±11.5%) i/s    (1.25 ms/i) -      3.978k in   5.044465s
                  oj    343.750 (±11.9%) i/s    (2.91 ms/i) -      1.683k in   5.004571s

Comparison:
                json:      758.1 i/s
          json_coder:      801.0 i/s - same-ish: difference falls within error
                  oj:      343.7 i/s - 2.21x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    59.000 i/100ms
          json_coder    53.000 i/100ms
                  oj    37.000 i/100ms
Calculating -------------------------------------
                json    828.807 (±15.1%) i/s    (1.21 ms/i) -      4.071k in   5.060739s
          json_coder    799.688 (±20.1%) i/s    (1.25 ms/i) -      3.816k in   5.019480s
                  oj    364.514 (± 7.1%) i/s    (2.74 ms/i) -      1.850k in   5.100773s

Comparison:
                json:      828.8 i/s
          json_coder:      799.7 i/s - same-ish: difference falls within error
                  oj:      364.5 i/s - 2.27x  slower

I am still working on this but I wanted to share progress.

Edit: Looks like I missed one commit so I'll have to resolve some merge conflicts.

@byroot
Copy link
Member

byroot commented Feb 3, 2025

The gain seem to be 7% on real word benchmarks:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.438k i/100ms
Calculating -------------------------------------
               after     24.763k (± 0.8%) i/s   (40.38 μs/i) -    124.338k in   5.021560s

Comparison:
              before:    23166.2 i/s
               after:    24762.5 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   254.000 i/100ms
Calculating -------------------------------------
               after      2.600k (± 1.3%) i/s  (384.61 μs/i) -     13.208k in   5.080852s

Comparison:
              before:     2439.5 i/s
               after:     2600.0 i/s - 1.07x  faster

Also note that I did one more refactoring to make the introduction of SIMD easier, so you still have a conflict.

Comment on lines 20 to 26
uint8x16x4_t load_uint8x16_4(const unsigned char *table, int offset) {
uint8x16x4_t tab;
for(int i=0; i<4; i++) {
tab.val[i] = vld1q_u8(table+offset+(i*16));
}
return tab;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it's not. vld4q_u8 interleaves the data among the 4 vector registers.

% cat load-test.c 
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>

void print_vec(char *msg, uint8x16_t vec) {
  printf("%s\n[ ", msg);
  uint8_t store[16] = {0};
  vst1q_u8(store, vec);
  for(int i=0; i<16; i++) {
    printf("%3d ", store[i]);
  }
  printf("]\n");
}

uint8x16x4_t load_table(uint8_t *table, int offset) {
  uint8x16x4_t tab;
  for(int i=0; i<4; i++) {
    tab.val[i] = vld1q_u8(table+offset+(i*16));
  }
  return tab;
}

int main(void) {
  uint8_t table[256];
  
  for(int i=0; i<256; i++) {
    table[i] = i;
  }

  uint8x16x4_t tab1 = load_table(table, 0);

  print_vec("tab1.val[0]", tab1.val[0]);
  print_vec("tab1.val[1]", tab1.val[1]);
  print_vec("tab1.val[2]", tab1.val[2]);
  print_vec("tab1.val[3]", tab1.val[3]);

  printf("\n");
  uint8x16x4_t tab1_2 = vld4q_u8(table);
  print_vec("tab1_2.val[0]", tab1_2.val[0]);
  print_vec("tab1_2.val[1]", tab1_2.val[1]);
  print_vec("tab1_2.val[2]", tab1_2.val[2]);
  print_vec("tab1_2.val[3]", tab1_2.val[3]);
  
  return 0;
}
% ./load-test 
tab1.val[0]
[   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 ]
tab1.val[1]
[  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 ]
tab1.val[2]
[  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47 ]
tab1.val[3]
[  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 ]

tab1_2.val[0]
[   0   4   8  12  16  20  24  28  32  36  40  44  48  52  56  60 ]
tab1_2.val[1]
[   1   5   9  13  17  21  25  29  33  37  41  45  49  53  57  61 ]
tab1_2.val[2]
[   2   6  10  14  18  22  26  30  34  38  42  46  50  54  58  62 ]
tab1_2.val[3]
[   3   7  11  15  19  23  27  31  35  39  43  47  51  55  59  63 ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's so weird.

Well, maybe that loop should be unrolled then, I suspect the compiler does it, but might as well be explicit.

@byroot
Copy link
Member

byroot commented Feb 3, 2025

Can you just include the implementation for the regular escaping? I'm not sure the script safe version is quite worth it.

Comment on lines 392 to 398
if ((ch_len = search_escape_basic_neon_advance_lut(search)) != 0) {
return ch_len;
}

// if ((ch_len = search_escape_basic_neon_advance_rules(search)) != 0) {
// return ch_len;
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's a toss up which one is the best. It might be an artifact that my M1 Macbook Air is passively cooled and it gets warm after I run it over and over.

@samyron
Copy link
Contributor Author

samyron commented Feb 6, 2025

Comparison between master and this branch in real world benchmarks. This is for the lookup table implementation.

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.027k i/100ms
Calculating -------------------------------------
               after     21.413k (± 1.6%) i/s   (46.70 μs/i) -    107.431k in   5.018339s

Comparison:
              before:    14448.8 i/s
               after:    21412.9 i/s - 1.48x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   110.000 i/100ms
Calculating -------------------------------------
               after      1.098k (± 1.2%) i/s  (910.41 μs/i) -      5.500k in   5.007977s

Comparison:
              before:      993.9 i/s
               after:     1098.4 i/s - 1.11x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   216.000 i/100ms
Calculating -------------------------------------
               after      2.086k (± 8.9%) i/s  (479.31 μs/i) -     10.368k in   5.034983s

Comparison:
              before:     1642.1 i/s
               after:     2086.3 i/s - 1.27x  faster

Running it a second time:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.042k i/100ms
Calculating -------------------------------------
               after     21.400k (± 1.7%) i/s   (46.73 μs/i) -    108.226k in   5.058877s

Comparison:
              before:    15039.4 i/s
               after:    21399.7 i/s - 1.42x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   109.000 i/100ms
Calculating -------------------------------------
               after      1.094k (± 1.2%) i/s  (913.67 μs/i) -      5.559k in   5.079778s

Comparison:
              before:     1005.4 i/s
               after:     1094.5 i/s - 1.09x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   215.000 i/100ms
Calculating -------------------------------------
               after      2.137k (± 5.5%) i/s  (467.91 μs/i) -     10.750k in   5.050467s

Comparison:
              before:     1639.0 i/s
               after:     2137.1 i/s - 1.30x  faster

…e only need 128 bytes for the lookup table as the top 128 bytes are all zeros.
@byroot
Copy link
Member

byroot commented Feb 6, 2025

Not sure why but it's way more modest on my machine (Air M3):

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.603k i/100ms
Calculating -------------------------------------
               after     26.544k (± 1.8%) i/s   (37.67 μs/i) -    132.753k in   5.002890s

Comparison:
              before:    23370.1 i/s
               after:    26543.7 i/s - 1.14x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   136.000 i/100ms
Calculating -------------------------------------
               after      1.368k (± 0.7%) i/s  (730.98 μs/i) -      6.936k in   5.070329s

Comparison:
              before:     1369.9 i/s
               after:     1368.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   269.000 i/100ms
Calculating -------------------------------------
               after      2.702k (± 0.3%) i/s  (370.11 μs/i) -     13.719k in   5.077550s

Comparison:
              before:     2475.0 i/s
               after:     2701.9 i/s - 1.09x  faster

@samyron
Copy link
Contributor Author

samyron commented Feb 10, 2025

Apologies for going dark for a while. I've been trying to make incremental improvements on a different branch (found here). My hope was using a move mask would be faster than vmaxvq_u8 to determine if any byte needs to be escaped. It also has the benefit of not needing to store all of the candidate matches as all that would be needed is a uint64_t which indicates which bytes need to be escaped. Unfortunately on my machine, it didn't seem to make much of a difference.

Feel free to try it out though.

@byroot
Copy link
Member

byroot commented Feb 10, 2025

Apologies for going dark for a while

That's no worries at all. I want to release a 2.10.0 with the current change on master, but I'm pairing with Étienne on making sure we have no blind spots on JSON::Coder. So probably gonna happen this week.

After that I think I can start merging some SIMD stuff. I'd like to go with the smaller possible useful SIMD acceleration to ensure it doesn't cause issues with people. If it works well, we can then go farther. So yeah, no rush.

@samyron
Copy link
Contributor Author

samyron commented Feb 11, 2025

@byroot if you have a few minutes, would you be able to checkout this branch and benchmark it against master. You'll have to tweak your compare script a bit to compile this branch with cmd("bundle", "exec", "rake", "clean", "compile", "--", "--disable-generator-use-simd"). I want to see how your M3 compares with my M1.

This branch uses the bit twiddling sort of platform agnostic SIMD code if the SIMD code is disabled via aextconf.rb flag.

The results on my M1:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     1.944k i/100ms
Calculating -------------------------------------
               after     19.671k (± 2.5%) i/s   (50.84 μs/i) -     99.144k in   5.043309s

Comparison:
              before:    15135.7 i/s
               after:    19670.9 i/s - 1.30x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   113.000 i/100ms
Calculating -------------------------------------
               after      1.109k (± 2.1%) i/s  (901.49 μs/i) -      5.650k in   5.095561s

Comparison:
              before:     1040.1 i/s
               after:     1109.3 i/s - 1.07x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   204.000 i/100ms
Calculating -------------------------------------
               after      2.006k (± 3.8%) i/s  (498.51 μs/i) -     10.200k in   5.092718s

Comparison:
              before:     1687.4 i/s
               after:     2006.0 i/s - 1.19x  faster

@byroot
Copy link
Member

byroot commented Feb 12, 2025

With that compilation flag and compared to master:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.326k i/100ms
Calculating -------------------------------------
               after     23.218k (± 1.6%) i/s   (43.07 μs/i) -    116.300k in   5.010271s

Comparison:
              before:    22460.3 i/s
               after:    23218.0 i/s - 1.03x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   132.000 i/100ms
Calculating -------------------------------------
               after      1.290k (± 1.4%) i/s  (775.38 μs/i) -      6.468k in   5.016121s

Comparison:
              before:     1323.6 i/s
               after:     1289.7 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   242.000 i/100ms
Calculating -------------------------------------
               after      2.495k (± 0.6%) i/s  (400.84 μs/i) -     12.584k in   5.044306s

Comparison:
              before:     2449.6 i/s
               after:     2494.8 i/s - 1.02x  faster

@samyron
Copy link
Contributor Author

samyron commented Feb 25, 2025

From a co-worker with an M4 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after     2.876k i/100ms
Calculating -------------------------------------
               after     28.251k (± 3.0%) i/s   (35.40 μs/i) -    143.800k in   5.095128s

Comparison:
              before:    24938.2 i/s
               after:    28251.0 i/s - 1.13x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   154.000 i/100ms
Calculating -------------------------------------
               after      1.516k (± 2.9%) i/s  (659.57 μs/i) -      7.700k in   5.083078s

Comparison:
              before:     1575.4 i/s
               after:     1516.1 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [arm64-darwin24]
Warming up --------------------------------------
               after   295.000 i/100ms
Calculating -------------------------------------
               after      2.933k (± 3.3%) i/s  (340.94 μs/i) -     14.750k in   5.034796s

Comparison:
              before:     2678.2 i/s
               after:     2933.0 i/s - 1.10x  faster

@samyron
Copy link
Contributor Author

samyron commented Feb 26, 2025

From another co-worker with an M1 Pro:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.166k i/100ms
Calculating -------------------------------------
               after     21.521k (± 1.2%) i/s   (46.47 μs/i) -    108.300k in   5.032957s

Comparison:
              before:    15231.1 i/s
               after:    21521.3 i/s - 1.41x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   108.000 i/100ms
Calculating -------------------------------------
               after      1.062k (± 5.5%) i/s  (941.69 μs/i) -      5.400k in   5.103989s

Comparison:
              before:     1013.4 i/s
               after:     1061.9 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   219.000 i/100ms
Calculating -------------------------------------
               after      2.061k (±12.8%) i/s  (485.22 μs/i) -     10.074k in   5.040974s

Comparison:
              before:     1677.4 i/s
               after:     2060.9 i/s - 1.23x  faster

@radiospiel
Copy link
Contributor

@samyron

I just pushed a PR #769 to this repo which also employs SIMD to speed up string escapes. I am really sorry that we both worked in that area at the same time; after I started my work I didn't check back with this repo for a while (and I should have done that.)

I believe the main difference between my PR and yours seem that mine supports x86 as well. It is doing this by using a cross-platform shim simd.h from Postgres, which comes with implementations on AVX, Neon, (and also on plain C). Still, on Neon I see somewhat higher gains than those reported here; however I don't understand where that difference comes from.

I want to suggest to collaborate on getting SIMD support in one way or another. 👋

@samyron
Copy link
Contributor Author

samyron commented Mar 18, 2025

Hi @radiospiel, I'll take a look at #769. I originally started working on #730 which supports Neon, SSE 4.2 and AVX2 with runtime detection support. The PR got a bit big so I decided to close it and implement each instruction set individually.

Additionally, @byroot refactored the code quite a bit to make the SIMD implementation quite a bit easier. There are two implementations in this PR, one uses a lookup table and the other is rule-based. Both seem to have similar performance on my machine.

On my machine I see a 11%-48% improvement depending on the benchmark. A few of my co-workers saw various speedups depending on their machine.

I should probably mark this PR as "Ready for Review". However, I'm happy to collaborate either on this or your PR.

Edit: oh yeah, there is an old-school bit-twiddling SIMD approach in pure C: #738

@samyron samyron marked this pull request as ready for review March 18, 2025 01:34
@radiospiel
Copy link
Contributor

radiospiel commented Mar 18, 2025

Thank you, @samyron .

I became painfully aware of the work you did when I tried to merge master into my branch, because the interface's of the escape functions had been changed; my implementation relies on a "escape me a uchar[] array into an fbuffer" which is no longer available with whats in master today :)

The main difference between your approach and mine is that you switch out the search functionality, depending on the availability of SIMD, while I switch out the SIMD primitives instead. This allows me to have working implementations for X86, ARM, and bit-twiddling; but only a handful of primitives are available because NEON and AVX are different, so your approach should allow for per-hardware type optimal implementations.

I have a busy week ahead of me, but I will definitively take a look end of the week. I will also benchmark on Graviton instances; most ARM server workloads are probably not on a Apple Silicon CPU after all :) Happy to benchmark this PR as well.

Can you share a benchmark script that produces the most useful output for you? I would be especially interested in understanding how you get the "before" and "after" entries in the benchmark output :)

Speaking of benchmarks:

On my machine I see a 11x-48x improvement depending on the benchmark.

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x%?
The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

@samyron
Copy link
Contributor Author

samyron commented Mar 18, 2025

This is magnitudes more than the numbers posted here. I have seen a 48% posted above (on the activitypub testcase), so is this a typo x%? The activitypub testcase, apparently, lends itself particularly well to SIMD; I see a speedup of ~82% on that (Apple M1)

Apologies, yes, that was a typo. I'll fix it in the comment above

@radiospiel
Copy link
Contributor

@samyron I reran benchmarks (link). Both our PRs show a substantial improvement over the baseline, the only significant difference is on short strings.

Encoding Type json 2.10.2 samyron radiospiel
strings.ascii 13.046k (± 1.6%) 29.681k (± 1.9%) 33.583k (± 3.0%)
strings.escapes 4.608k (± 1.9%) 10.765k (± 2.2%) 9.681k (± 2.5%)
strings.mixed 32.971k (± 1.4%) 88.580k (± 2.1%) 90.133k (± 3.2%)
strings.multibyte 32.836k (± 2.0%) 89.385k (± 3.0%) 89.475k (± 2.1%)
strings.short 91.819k (± 9.8%) 95.388k (± 2.5%) 133.008k (± 2.6%)
strings.tests 21.350k (± 4.1%) 22.538k (± 2.7%) 22.600k (± 2.5%)

strings.short is a test on a 13-byte string ("b" * 5) + "€" + ("a" * 5), which is shorter than the size of the SIMD buffer (which in my case is 16 byte.).

I believe such short strings are relevant, because JSON object keys are probably quite often shorter than 16 byte; my PR applies SIMD for strings of 8 byte and more (link). (The value of 8 seemed beneficial and looked nice, but I should probably retest this with smaller values.)

Maybe you could be able to support that as well?

@radiospiel
Copy link
Contributor

radiospiel commented Mar 23, 2025

@byroot we have two competing implementations of the same approach. While mine is probably more beneficial in the short term (because it also supports x86), I believe that @samyron 's approach has more future potential, because it allows handcrafted SIMD implementations that are fundamentally different between NEON and SSE2. (and it certainly can be extended to also support shorter strings, see comment above.)

Also, transplanting a x86 implementation from my PR into @samyron 's shouldn't be too hard to achieve.

I see the following alternatives:

  • we scrap mine, @.samyron adds support for shorter strings, and, in a follow up we transplant SSE2 into @.samyron's;
  • we merge mine, with the understanding that @.samyron's will be merged in at a later point, with SSE2 support right out of the box; mine will be removed again

What do you all think about that? ☝️

#ifdef ENABLE_SIMD

#if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(__aarch64__) || defined(_M_ARM64)
#include <arm_neon.h>

This comment was marked as resolved.

@byroot byroot force-pushed the arm-neon-simd-v2 branch from 42744f6 to 51635ad Compare April 27, 2025 10:15
@byroot byroot force-pushed the arm-neon-simd-v2 branch from 51635ad to c999baf Compare April 27, 2025 10:19
@byroot
Copy link
Member

byroot commented Apr 27, 2025

Alright. I think it looks good to me. I've pushed some small simplification for NEON which I'd like your opinion on. If you think it's OK then Ineed to do the same change for SSE2, otherwise we revert back to checking popcount first for NEON (if both algos can be similar it's better).

Other than that I fixed a few typos and added a CI job that disable SIMD.

Once we're settled on that pocount thing, I'll cleanup the git history and merge.

@samyron
Copy link
Contributor Author

samyron commented Apr 27, 2025

Alright. I think it looks good to me. I've pushed some small simplification for NEON which I'd like your opinion on. If you think it's OK then Ineed to do the same change for SSE2, otherwise we revert back to checking popcount first for NEON (if both algos can be similar it's better).

Other than that I fixed a few typos and added a CI job that disable SIMD.

Once we're settled on that pocount thing, I'll cleanup the git history and merge.

The popcount heuristic was an attempt to not call neon_next_match (effective) in a tight loop if more than half of the bytes in the chunk need to be escaped. In the macro benchmarks it didn't seem to make much difference, but in the synthetic worst-case benchmarks it seemed to help quite a bit. It's certainly not needed for correctness.

I'm good either way, I'm not sure if we should really focus on optimizing for synthetic worst-case benchmarks. I'm just trying to avoid any case where this the SIMD-code performs worse than the scalar implementation.

uint8x16_t has_dblquote = vceqq_u8(chunk, dblquote);
uint8x16_t needs_escape = vorrq_u8(too_low, vorrq_u8(has_backslash, has_dblquote));

vandq_u8(needs_escape, vdupq_n_u8(0x1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the popcount based code, this line isn't necessary.

@samyron
Copy link
Contributor Author

samyron commented Apr 28, 2025

I pushed a commit to simplify updating search->ptr after all of the characters in chunk are escaped. It does keep require an extra pointer being stored in search_state but it does prevent a branch. It doesn't seem to make any difference in performance but I think the code is a little nicer.

@samyron
Copy link
Contributor Author

samyron commented Apr 28, 2025

Additionally, if we decide to not us the popcount code, there are further simplifications that can be made. For example there is no need for the loop in convert_UTF8_to_JSON after search_escape_basic_impl returns non-zero because the search code will never return more than 1.

@byroot
Copy link
Member

byroot commented Apr 28, 2025

I'm good either way, I'm not sure if we should really focus on optimizing for synthetic worst-case benchmarks. I'm just trying to avoid any case where this the SIMD-code performs worse than the scalar implementation.

That makes sense. I also don't think we should optimize for the worst case scenario at the expense of code simplicity.

I can't really imagine it's common to need this much escaping in a 16 characters sequence. As long as the macro-benchmark don't regress, I'm happy.

@byroot byroot force-pushed the arm-neon-simd-v2 branch from d41c593 to 142dce7 Compare April 28, 2025 06:21
@byroot
Copy link
Member

byroot commented Apr 28, 2025

Alright I have removed the popcount logic.

One thing I wonder now is, do we really need that has_matches property? I don't see how it's different from checking if matches_mask is 0.

@byroot
Copy link
Member

byroot commented Apr 28, 2025

Nevermind, I understand now. We need to have that info when we call back into search_escape_basic, so we now we're done escaping that chunk and can move the pointer up.

@byroot byroot force-pushed the arm-neon-simd-v2 branch from 142dce7 to e50b5df Compare April 28, 2025 06:25
@byroot
Copy link
Member

byroot commented Apr 28, 2025

I refactored search_escape_basic_neon and search_escape_basic_sse2 a bit further. I think it's almost at a point where it wouldn't be too hard to have the SIMD bits be parameters (e.g. chunk size, etc), which would allow to have a single function for the logic, and would make maintaining more SIMD implementations easier (e.g. AVX, or implementation for script_safe: true).

But this can wait.

I'm satisified with the current PR, let me know if you don't have anything else to add either.

@byroot
Copy link
Member

byroot commented Apr 28, 2025

What the hell? Why does bigdecimal suddenly fail to compile on Windows?

@byroot
Copy link
Member

byroot commented Apr 28, 2025

It's failing on master too now. I suspect GitHub updated the compiler or something -_-.

@samyron
Copy link
Contributor Author

samyron commented Apr 28, 2025

I'm satisified with the current PR, let me know if you don't have anything else to add either.

I have nothing more to add at this time. Anything additional in the future can be a follow up.

@byroot byroot merged commit 4900352 into ruby:master Apr 28, 2025
28 of 34 checks passed
@byroot
Copy link
Member

byroot commented Apr 28, 2025

Thank both for all the work on this. Now I'll try to update json in ruby/ruby and cross fingers it won't break on the extended CI.

byroot pushed a commit to byroot/ruby that referenced this pull request Apr 28, 2025
(ruby/json#743)

See the pull request for the long development history: ruby/json#743

```
== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.913k i/100ms
Calculating -------------------------------------
               after     29.377k (± 2.0%) i/s   (34.04 μs/i) -    148.563k in   5.059169s

Comparison:
              before:    23314.1 i/s
               after:    29377.3 i/s - 1.26x  faster

== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   152.000 i/100ms
Calculating -------------------------------------
               after      1.569k (± 0.8%) i/s  (637.49 μs/i) -      7.904k in   5.039001s

Comparison:
              before:     1485.6 i/s
               after:     1568.7 i/s - 1.06x  faster

== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   309.000 i/100ms
Calculating -------------------------------------
               after      3.115k (± 3.1%) i/s  (321.01 μs/i) -     15.759k in   5.063776s

Comparison:
              before:     2508.3 i/s
               after:     3115.2 i/s - 1.24x  faster
```

ruby/json@49003523da
@samyron
Copy link
Contributor Author

samyron commented Apr 28, 2025

Awesome. Thank you for seeing this through. I know it took a while with some very messy PRs. I'm happy to jump back in and fix issues and/or take this further with different implementations in the future.

There is an additional AVX2 implementation in the original PR that I can get re-implement within the new searching code.

Additionally, there is an SSE4.2 instruction that may also be useful pcmpestri.

@byroot
Copy link
Member

byroot commented Apr 28, 2025

There is an additional AVX2 implementation in the original PR that I can get re-implement within the new searching code.

Sure.

Additionally, there is an SSE4.2 instruction that may also be useful

I don't know how many implementation it's really worth to have. I think it makes sense to have SS2 as the baseline the overwhelming majority of x86-64 CPUs will have, and then probably another one that is in a sweet spot between efficiency and availability. e.g. not sure it's worth doing AVX-512 given it's not even in some newly released CPUs.

So can probably include a SSE4.2 implementation, or an AVX-2 implementations, but both wouldn't be worth it I think.

@byroot
Copy link
Member

byroot commented Apr 28, 2025

and cross fingers it won't break on the extended CI.

Ok, so it broke at least the WASM CI and the i686 one: ruby/ruby#13194

I'll see what I can do about it.

byroot pushed a commit to byroot/ruby that referenced this pull request Apr 28, 2025
(ruby/json#743)

See the pull request for the long development history: ruby/json#743

```
== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.913k i/100ms
Calculating -------------------------------------
               after     29.377k (± 2.0%) i/s   (34.04 μs/i) -    148.563k in   5.059169s

Comparison:
              before:    23314.1 i/s
               after:    29377.3 i/s - 1.26x  faster

== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   152.000 i/100ms
Calculating -------------------------------------
               after      1.569k (± 0.8%) i/s  (637.49 μs/i) -      7.904k in   5.039001s

Comparison:
              before:     1485.6 i/s
               after:     1568.7 i/s - 1.06x  faster

== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   309.000 i/100ms
Calculating -------------------------------------
               after      3.115k (± 3.1%) i/s  (321.01 μs/i) -     15.759k in   5.063776s

Comparison:
              before:     2508.3 i/s
               after:     3115.2 i/s - 1.24x  faster
```

ruby/json@49003523da
byroot pushed a commit to byroot/ruby that referenced this pull request Apr 29, 2025
(ruby/json#743)

See the pull request for the long development history: ruby/json#743

```
== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.913k i/100ms
Calculating -------------------------------------
               after     29.377k (± 2.0%) i/s   (34.04 μs/i) -    148.563k in   5.059169s

Comparison:
              before:    23314.1 i/s
               after:    29377.3 i/s - 1.26x  faster

== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   152.000 i/100ms
Calculating -------------------------------------
               after      1.569k (± 0.8%) i/s  (637.49 μs/i) -      7.904k in   5.039001s

Comparison:
              before:     1485.6 i/s
               after:     1568.7 i/s - 1.06x  faster

== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   309.000 i/100ms
Calculating -------------------------------------
               after      3.115k (± 3.1%) i/s  (321.01 μs/i) -     15.759k in   5.063776s

Comparison:
              before:     2508.3 i/s
               after:     3115.2 i/s - 1.24x  faster
```

ruby/json@49003523da
byroot pushed a commit to ruby/ruby that referenced this pull request Apr 30, 2025
(ruby/json#743)

See the pull request for the long development history: ruby/json#743

```
== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     2.913k i/100ms
Calculating -------------------------------------
               after     29.377k (± 2.0%) i/s   (34.04 μs/i) -    148.563k in   5.059169s

Comparison:
              before:    23314.1 i/s
               after:    29377.3 i/s - 1.26x  faster

== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   152.000 i/100ms
Calculating -------------------------------------
               after      1.569k (± 0.8%) i/s  (637.49 μs/i) -      7.904k in   5.039001s

Comparison:
              before:     1485.6 i/s
               after:     1568.7 i/s - 1.06x  faster

== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision ruby/json@d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   309.000 i/100ms
Calculating -------------------------------------
               after      3.115k (± 3.1%) i/s  (321.01 μs/i) -     15.759k in   5.063776s

Comparison:
              before:     2508.3 i/s
               after:     3115.2 i/s - 1.24x  faster
```

ruby/json@49003523da
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants